Exercise may delay cognitive decline in Chinese older adults: a causal inference for ordered multi-categorical exposures with a Mendelian randomization approach

The cognitive problems are prominent in the context of global aging, and the traditional Mendelian randomization method is not applicable to ordered multi-categorical exposures. Therefore, we aimed to address this issue through the development of a method and to investigate the causal inference of cognitive-related lifestyle factors. The study sample was derived from the Chinese Longitudinal Healthy Longevity Survey, which included 897 older adults aged 65 + . This study used genome-wide association analysis to screen genetic loci as instrumental variables and innovatively combined maximum likelihood estimation to infer causal associations between ordered multi-categorical exposures (diet, exercise, etc.) and continuous outcomes (cognitive level). The causal inference method for ordered multi-categorical exposures developed in this study was simple, easy to implement, and able to effectively and reliably discover the potential causal associations between variables. Through this method, we found a potential positive causal association between exercise status and cognitive level in Chinese older adults (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\beta }$$\end{document}β^ = 1.883, 95%CI 0.182–3.512), in which there was no horizontal pleiotropy (p = 0.370). The study provided a causal inference method applicable to ordered multi-categorical exposures, that addressed the limitations of the traditional Mendelian randomization method.

uses genetic variation as the instrumental variable (IV) to make causal inferences about the effect of exposure on outcome [6][7][8] .However, this method has limitations when the exposure is ordered multi-categorical variables as such.Briefly, if an observed multi-level categorical exposure (i.e., intensities of exercise) is the manifestation of an underlying continuous exposure (i.e., motivation to exercise) passing certain cut-off points, the observed level of category may be stable even if the latent exposure has changed, because the latter haven't crossed one of the cut-offs.This may violate the assumption of exclusivity of the instrumental variable [9][10][11] .In this scenario, disregarding the relationship between a latent, invisible continuous exposure and its external, visible categorical manifestation will bias the MR/IV estimation of effect size (if the latter is a strong stepwise mediator), or even confound the claim of casual association (if the latter is mostly a surrogate) 12 .
In view of the above issues, we aimed to develop and test an analytical pipeline of Mendelian randomization, tailed to ordered multi-categorical exposures and to further explore potential causal associations between cognitive levels and their influencing factors in the Chinese older adults.

Data sources
Data for this study were obtained from the Chinese Longitudinal Healthy Longevity Survey (CLHLS) database (1998-2018) 13,14 .This was the largest and most extensive cohort of older adults in China, encompassing more than 40,000 older adults in total.The project mainly collected information on the socio-demographic characteristics, lifestyles, and health status of participants through a questionnaire (self-report).A total of 908 surviving older adults aged 65 years or older (based on the age of the participants when they first participated in the survey) were selected from individuals who participated in both the questionnaire and whole genome sequencing (WGS).Meanwhile, we collated their baseline information.After quality control (QC) 15 of the genetic data, a total of 897 older adults aged 65-110 years were included (no missing data), with a male to female ratio of approximately 1:2.74.

Cognition
Individual cognitive data were collected and calculated by the Mini-mental State Examination (MMSE).This scale provides a comprehensive assessment of the individual's general ability, reactivity, attention and calculation, recall, language, comprehension and self-coordination.The total score is 30, with higher scores indicating a better cognitive status 16 .

Living habits
Based on the results of our previous studies conducted on CLHLS 4,5 , we selected those factors that had some association with individual cognitive level for inclusion in this study.The variables in this section included individual drinking status, exercise status, dietary habits, and activity participation at their baseline.The drinking status were categorized by drinking (including abstainers) and never drinking.Exercise status was similarly divided into two categories according to exercise and never exercise.Diet and activity participation were recorded in terms of frequency, in descending order of "frequently", "occasionally" and "rarely/never", with higher values representing lower frequency.Other variables and detailed descriptions were given in Additional file 1.

Gene
Genetic data were obtained from the Longevity Study (CLHLS) conducted by Prof. Yi Zeng. 17The data were generated from whole genome sequencing and genotyped by Illumina humanomnizhhua-8 BeadChips.The chip was created by selecting optimally tagged single nucleotide polymorphisms (SNPs) from all three phases of the the International HapMap Project as well as from the Thousand Genomes Project.It covered 900 015 SNPs, including 600 000 common variants (Minor allele frequency, MAF ≥ 5%), 290 000 rare variants (MAF < 5%), and 10 000 SNPs present only in Chinese and other Asian populations.In addition to directly sequenced SNPs, the project also used IMPUTE software (version 2) to infer genotypes for SNPs with MAF ≥ 0.01 (The 1000 Genomes Project as a reference), representing approximately 85.38% 17 .
We performed quality control on the raw genetic data and examined the population stratification (multidimensional scaling, MDS) 15 .A total of 3 240 266 SNPs were included in this study with genotype data covering autosomes 1-22.The genotype of each SNP was recorded in the form of an additive model (i.e., 0, 1, 2).Chromosome location information was obtained from hg19.

Genome-wide association study
Genome-wide association study (GWAS) 15 was conducted by plink 1.9 for individual cognitive level, living habits and potential confounders (e.g., educational status and stroke/CVD 5 ).Among them, continuous variables were analyzed using linear regression, while categorical variables were performed through logistic regression.The covariates included were age, sex, ethnicity, and 10 dimensions representing stratified characteristics of the population obtained by multidimensional scaling.Subsequently, the Quantile-Quantile plot (Q-Q plot) and Manhattan plot were plotted based on the results obtained from GWAS.

Instrumental variables
Instrumental variables (IVs) were selected from independent loci that were significant in GWAS.In order to better validate the generalizability and stability of our method, we relaxed the selection criterion of p-value to "close to significance" when selecting instrumental variables.Based on the results of GWAS, independent and significant SNPs (p < 1 × 10 -5 ) satisfying the three assumptions of instrumental variables 8 were selected as IVs for each exposure after screening to exclude those loci that were associated with known confounders.
Specifically, the sub-routine groups significant SNPs into genome segments, that is, clumps, that are independently and significantly associated with the outcome, each indexed by one of the highly significant SNPs within.Each new clump is iteratively isolated by (1) greedily search the next most significant SNP with p-value no larger than 1e -5 (via -clump-p1 0.00001) but not yet enclosed in any existing clumps, that is, the so-called index variant of the soon to be new clump, then (2) adding variants within close proximity to the index, both in terms of physical units (+ /− 500 KB, via -clump-kb 500) and linkage disequilibrium (r 2 > 0.2, via -clump-r2 0.2).The index SNPs are then taken as independent loci for an outcome of interest 18 .

Causal inference for ordered multi-categorical exposure
In order to better fit the CLHLS data while addressing the limitations in the traditional Mendelian randomization, a simple and reliable causal inference method was presented in this study.An ordered multi-categorical variable (exposure) can be regarded as an unknown continuous latent variable divided by the corresponding threshold.Thus, the model for causal inference with the instrumental variables can be expressed as (Fig. 1): The association between variables can be represented by the following equation: (G, X, Y ) are the observable data.Without loss of generality, we assume that U ∼ N(0, 2 ) , G , U , ǫ 1 and ǫ 2 are mutual independent of each other.We want to estimate the causal effect ( β ) of X * on Y .Combining the above equation shows that the relationship between Y and G can be expressed by the following equation: G , U , ǫ 1 and ǫ 2 are independent of each other, so the parameter θβ can be identified.Regression is performed on G utilizing Y.The regression coefficient is the estimate of the parameter θβ .If non-zero θ can be identified, the parameter β can be identified.
Note that Directed acyclic graph (DAG) of instrumental variables in the causal inference of ordered multicategorical variables."G" denotes instrumental variables (genetic variation), "X" indicates exposure, "X * " denotes latent variables, "Y" and "U" represent outcome and confounders, respectively.
where is the cumulative distribution function of standard normal.From the above equation, it is clear that the parameter θ is not identifiable.Further we used Eq. ( 8), which after transformation yields Subsequently, it is easy to know that θ σ can be identified by this equation.If one really wants to estimate the causal effect, they could try to estimate σ 2 from a separate study.σ 2 is connect to heritability of the con- tinuous latent exposure X * .According to Eq. (1) and the definition of σ 2 , V X * = V G + σ 2 , where V X * is the remainder of variance of X * after accounting for non-confounding covariants (e.g., age, sex, and first few geno- type principal components), and V G is variation in X * attributed to the genotype.If one knows the heritability , one could estimate V G , then σ 2 -the part of variation in latent variable X * not attributed to genotype could be estimated, and thus one be able to estimate the true effect θ of the MR/IV model, and lastly, the true effect of X * on outcome Y.
For combining all observations, we used the maximum likelihood estimation (which produces smaller variance and mean squared error 19 ) for the calculation.The log-likelihood function is given as: where P X i = x i |G i g i was defined by Eqs. (8-10),which is easily generalizable exposures of more than 3 levels.
The above derivation still holds if there are multiple instrumental variables (i.e., G is a vector).We can first use the observed data (G, X) to get the estimate θ of θ by the maximum likelihood estimation, and then perform a regression on θG using Y (same as two-stage regression).The estimated β of the causal effect (β) can be obtained by this process.Because the estimate of θ (i.e., the estimate of θ σ ) in this method differs from the true value by a constant multiple ( σ ), the result of this two-stage estimate also has a difference of a constant multiple ( σ ) from the true value.If this estimate is significantly non-zero, then the true causal effect is significantly non-zero.Therefore, this method can be used to infer whether ordered multi-categorical exposures (commonly found in questionnaires) have a significant causal effect on outcome (IVs need to be found with genetic data).In this study, the confidence interval (CI) of the causal effect was estimated by the bootstrap (the "boot" package of R software).
We conducted simulations using ten data sets randomly generated (n = 1000, Group A-G in which Y was correlated with X * and the last three groups were uncorrelated) combined with repeated sampling (500 times per group), and the results validated the validity and reliability of this method for testing potential causal associations.The simulation results were shown in Table 1.
Furthermore, the factors ( X ) identified by previous studies as significantly associated with cognitive level ( Y ) were used as exposures and analyzed in conjunction with our causal inference method.The exposures were drinking status, the intake frequency of fish, fruit, garlic, legume, meat, sugar, and vegetable, exercise status, and the participation in activities such as housework, mahjong, open-air activities, pet ownership, reading, and television/radio, respectively.
For the statistically significant results among them, we used the MR-Egger method to test the potential horizontal pleiotropy.( 10) . Simulation of causal inference for ordered multi-categorical exposure."a" represents the number of instrumental variables, "Normal" means the CI was calculated by the normal approximation method, "Basic" indicates the CI was calculated by the basic bootstrap method and "Percentile" stands for the CI was calculated by the bootstrap percentage method.The ten data sets were randomly generated with a preset σ of 1.5.variables.The IVs included for each exposure were the same as the independent loci screened above, as detailed in additional file 2.

Simulation
After multiple replicate calculations (500 replicates for each exposure), the results showed a positive causal association between individuals' exercise status and their own cognitive level ( β = 1.883, 95%CI 0.182-3.512).In contrast, there was no causal association between diet, activity, and cognitive level of older adults.The specific results were shown in Table 2.
In response to the statistically significant results, a further sensitivity analysis confirmed that there was no horizontal pleiotropy between cognitive level and exercise status (p > 0.05, Table 3).This proved that the causal effect described above was valid and reliable.

Discussion
Currently, Mendelian randomization is the mostly commonly accepted method of causal inference in medical field.The traditional MR, however, has a relatively limited field of application and usually prefers causal inference of exposure and outcome in the form of continuous variables.If categorical variables were directly analyzed as exposures, the estimated causal effects would be inaccurate 10,20 .Although the methodology in this field has been expanded somewhat in recent years, causal inference involving categorical variables has focused more on cases where the outcome is a categorical variable [21][22][23] , and studies of categorical exposures are still lacking.
Causal inference with exposure in the form of categorical variables such as binary categories was highly sensitive to the choice of thresholds.Since categorical exposures typically only contain effects associated with boundary points and category changes and do not adequately capture subtle changes (failure to change category) in its original variables (continuous exposures), true causal effect are often difficult to explore.In addition, the causal estimates calculated by MR can sometimes be difficult to interpret.Therefore, it has been suggested that MR should focus more on testing for potential causal effects rather than trying to compute estimates of causal effects 24,25 .
The innovative method in this study treats ordered multi-categorical variables as divided from continuous variables by specific boundary points, and on this basis, the causal association between ordered multi-categorical exposures and continuous outcomes was inferred and uncovered through maximum likelihood estimation.This method can easily and effectively explore the causal associations between exposures and outcomes.Meanwhile, this added a new theoretical basis to the field and provided evidence to support the development of subsequent methods.In addition, this method estimated causal effects that differed from the true effects by a constant Table 2. Causal inference results for ordered multi-categorical exposure."*" represents that the variable was reverse coded, with higher values indicating lower frequency, as described in detail in Additional file 1. "a" represents the number of instrumental variables, "b" represents the number of samples for each exposure (excluding individuals with missing information on the corresponding IVs), and "c" indicates that the p value is reverse derived from CI through a normal distribution.multiple.This situation was consistent with the idea mentioned in previous studies that using coarsened measures (categorical exposures) as exposures for estimation leads to bias, which would amplify or reduce their effect estimates without inverting their sign 10 .
In addition, the study confirmed a potential causal association between exercise status and individual cognitive level through our method.The results suggested that exercise could delay the cognitive decline.A recent international study that analyzed the brains of hundreds of deceased older adults found that individuals who participated more in daily exercise had higher levels of presynaptic proteins and better synaptic integrity in their brains during old age 26 This is the first time that the positive effects of exercise on synaptic function have been demonstrated in humans.Earlier studies have shown that only 10 min of moderate intensity running per day can improve mood and cognitive performance 27 .Exercise such as running increases blood flow to five cortical areas: l-DLPFC, l-FPA, r-DLPFC, r-VLPF and r-FPA.The stimulation received by these areas played a very important role in improving mood and cognition.Regardless of the activity, high-frequency exercise invariably increases the cognitive load of older adults.After exercise with high cognitive load, it enhanced functional connectivity in the superior frontal gyrus and prefrontal cortex and reduced functional connectivity in the middle occipital gyrus and postcentral gyrus at rest, which in turn delayed cognitive decline 28 .
On the other hand, the positive effects from exercise were not really lasting 26 .This also demonstrated to some extent the importance of maintaining a long-term, healthy exercise habit in the older adults.The adoption of measures such as appropriate physical activity was necessary no matter what age.As Dr. Robert S. Wilson said 29 , "It's never too late to start participating in such activities.Even in your 80 s, cognitive stimulation activities can delay the onset of Alzheimer's disease.

Conclusion
The causal inference method developed in this study combined with the maximum likelihood estimation addressed the problem of difficult causal inference for ordered multi-categorical exposures and compensated for the limitations of traditional MR methods.This not only provides us a practical tool, but also a different way of developing methodologies in this field.In addition, we uncovered a potential positive causal association between exercise and cognition through a more in-depth analysis based on previous studies.This result further confirmed the positive effect of exercise on cognitive health in Chinese older adults and added stronger evidence for delaying cognitive decline in the older individuals.

Table 3 .
Results of the horizontal pleiotropy test.